The landscape of machine Learning Models
14 Oct 2021
You may be familiar or have experience in one or more of the following domain areas
With few exceptions, when these terms are used to discuss a learning algorithm, what’s really meant is the field of machine learning
Despite being a broad field, most machine learning algorithms can be subdivided in into three main classes
Classes of Machine learning algorithms with applications (reference: https://medium.com/@sanchittanwar75/introduction-to-machine-learning-and-deep-learning-bd25b792e488)
Supervised learning algorithms
Unsupervised learning algorithms
Pertain to data that includes ONLY inputs (features)
Partition (cluster) the data in a meaningful way
Basic idea: We DO NOT HAVE outputs
Examples of when an unsupervised learning algorithm would be used
Example applications of regression algorithms
The following algorithms can be used to model (describe) the relationship between inputs and outputs when the outputs are numeric and continuous
Classification is focused on predicting/labeling a discrete output (categories)
There can be more that two (yes/no) categories
Example applications of classification algorithms
The following algorithms can be used to model (describe) the relationship between inputs and outputs when the outputs are discrete or categorical
Before moving forward I want to make sure that you have a solid (yet high level) understanding of how supervised learning algorithms work under the hood
All supervised learning algorithms have the following elements
Suppose you’re asked to create a model to describe the relationship between one set of inputs and one set of outputs
x and the set of outputs yx and y in the figure belowPlot of some ideal data
Observing the figure, there doesn’t appear to be any uncertainty in the data as each point falls on a straight line
An obvious choice for a model to describe this data would be a function of the form \(y = mx +b\) - the familiar equation for a line
Adding a plot of the line \(y(x) = mx+b\) shows that a “perfect” model exists for our ideal data
Fitting the ideal data with a “perfect” model
Now that we’ve chosen a functional form for \(f_{\text{imperfect}}\) we turn our attention to the parameters of the model
Every model comes with parameters – the values assigned to these parameters affect how well \(f_{\text{imperfect}}\) represent the relationship between x and y
For our chosen \(f_{\text{imperfect}}\) we see that there are two paramters the slope (\(m\)) and the intercept (\(b\))
The question, of course is what are the correct values of the slope \(m\) and the intercept \(b\) that best represent the relationship between x and y
For this ideal data set, we can determine these values using our knowledge of straight lines and our ability to read or we could read the value of the intercept directly from the plot as \(b = 3\)
Using this value for \(b\) we can choose any of the data points and solve for the slope \(m = 5\)
I know what you’re thinking: What does this have to do with machine learning?
x and y constraining this relationship the form of \(f_{\text{imperfect}}\) that we chose aboveA
Loss functions are a key component in all supervised learning algorithms
In many cases, you don’t have to choose a loss function you use to find the optimal parameter values
Rather, it comes as part of a package deal with the modeling approach you choose (i.e. linear regression, logistic regression, etc.).
You can, however, come up with your own loss function - so long as it produces meaningful results
For our perfect example data, we can choose among several different loss functions
However, not all of them will return an accurate solution
In the sections below I walk through the choice of several different loss functions and plot the results
In this case we use a naive loss function that represents the difference between the observed output and the output returned by the proposed model
The loss function is expressed as shown below
\[ Loss_{_{naive}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N y_i-m\times x_i-b. \]
Using this function, loss would simply be defined as the sum of the vertical distances between each observed output \(y_i, i = 1,...,n\) and the output returned by the chosen model
The parameters \(m\) and \(b\) for the best-fit line correspond to model that has the minimum loss.
For our “ideal” data, the points fall on a straight line and we would expect the loss value in this case to be zero
Thus far, we’ve chosen a functional form that we believe is a good representation of the data – and a corresponding loss function
In the chunk below we define our naive loss function
loss_naive <- function(params,x,y) {
if(length(params) != 2) stop("Params should be a length 2 vector")
m <- params[1]
b <- params[2]
return(sum(y - m * x - b))
}stats::optim() function to find the values of \(m\) and \(b\) that minimize the loss function and result in a model that best-fits the data$par
[1] 2.684172e+55 4.996335e+54
$value
[1] -1.111779e+57
$counts
function gradient
501 NA
$convergence
[1] 1
$message
NULL
Looking at these results, it’s clear that something isn’t right - why?
\[ Loss_{_{absolute}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N \Big\vert y_i-m\times x_i-b\Big\vert. \]
# First define a function to optimize
loss_absolute <- function(params,x,y) {
if(length(params) != 2) stop("Params should be a length 2 vector")
m <- params[1]
b <- params[2]
return(sum(abs(y - m * x - b)))
}stats::optim() function to find the values of \(m\) and \(b\) that minimize the loss function and result in a model that best-fits the dataoptim(par = c(1,1), # provide starting values for m and b
fn = loss_absolute, # define function to optimize
x = df$x, # provide values for known parameters
y = df$y, # provide values for known parameters
control = list(fnscale = 1))$par
[1] 5 3
$value
[1] 3.301596e-06
$counts
function gradient
121 NA
$convergence
[1] 0
$message
NULL
The problem is that linear functions are unconstrained
A better option would be to propose a loss functions that is convex, such as
\[ Loss_{_{convex}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N \Big( y_i-m\times x_i-b\Big)^2. \]
Note that we are minimizing the squared distances between the observed values and the proposed model - hence this is called
For the last time let’s define our convex loss function
loss_convex <- function(params,x,y) {
if(length(params) != 2) stop("Params should be a length 2 vector")
m <- params[1]
b <- params[2]
return(sum((y - m * x - b) ^ 2))
}stats::optim() function to find the values of \(m\) and \(b\) that minimize the loss function and result in a model that best-fits the dataoptim(par = c(1,1), # provide starting values for m and b
fn = loss_convex, # define function to optimize
x = df$x, # provide values for known parameters
y = df$y, # provide values for known parameters
control = list(fnscale = 1))$par
[1] 4.999800 3.000373
$value
[1] 2.501828e-06
$counts
function gradient
71 NA
$convergence
[1] 0
$message
NULL
You were introduced to several elements of machine learning models
We examined an application of the several loss functions to a idealized data and saw that the least squares loss function provided the best results
We can also expand upon this idealized data set by introducing noise to the data to make it “less ideal”
Finally, we could also change assumed form of the model that we’ve chosen to describe the data
It’s important that you recognize that there are often various forms of supervised models that could be used to describe the relationship that exists between the inputs and outputs in a data set
In the following presentations, we’ll get into probability distributions and use them to test hypothesis about the world
This will lead us to discussing another method for fitting models to data